Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 678
Filtrar
1.
PLoS Biol ; 20(2): e3001536, 2022 02.
Artigo em Inglês | MEDLINE | ID: mdl-35167588

RESUMO

The importance of sampling from globally representative populations has been well established in human genomics. In human microbiome research, however, we lack a full understanding of the global distribution of sampling in research studies. This information is crucial to better understand global patterns of microbiome-associated diseases and to extend the health benefits of this research to all populations. Here, we analyze the country of origin of all 444,829 human microbiome samples that are available from the world's 3 largest genomic data repositories, including the Sequence Read Archive (SRA). The samples are from 2,592 studies of 19 body sites, including 220,017 samples of the gut microbiome. We show that more than 71% of samples with a known origin come from Europe, the United States, and Canada, including 46.8% from the US alone, despite the country representing only 4.3% of the global population. We also find that central and southern Asia is the most underrepresented region: Countries such as India, Pakistan, and Bangladesh account for more than a quarter of the world population but make up only 1.8% of human microbiome samples. These results demonstrate a critical need to ensure more global representation of participants in microbiome studies.


Assuntos
Microbioma Gastrointestinal/genética , Genômica/métodos , Metagenoma/genética , Metagenômica/métodos , Microbiota/genética , Ásia , Bangladesh , Canadá , Países Desenvolvidos , Europa (Continente) , Genômica/estatística & dados numéricos , Geografia , Humanos , Índia , Metagenômica/estatística & dados numéricos , Paquistão , Estados Unidos
2.
J Comput Biol ; 29(2): 155-168, 2022 02.
Artigo em Inglês | MEDLINE | ID: mdl-35108101

RESUMO

k-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. Here, we consider the simple model where a sequence S (e.g., a genome or a read) undergoes a simple mutation process through which each nucleotide is mutated independently with some probability r, under the assumption that there are no spurious k-mer matches. How does this process affect the k-mers of S? We derive the expectation and variance of the number of mutated k-mers and of the number of islands (a maximal interval of mutated k-mers) and oceans (a maximal interval of nonmutated k-mers). We then derive hypothesis tests and confidence intervals (CIs) for r given an observed number of mutated k-mers, or, alternatively, given the Jaccard similarity (with or without MinHash). We demonstrate the usefulness of our results using a few select applications: obtaining a CI to supplement the Mash distance point estimate, filtering out reads during alignment by Minimap2, and rating long-read alignments to a de Bruijn graph by Jabba.


Assuntos
Mutação , Análise de Sequência de DNA/estatística & dados numéricos , Algoritmos , Sequência de Bases , Biologia Computacional , Intervalos de Confiança , Genômica/estatística & dados numéricos , Humanos , Modelos Genéticos , Alinhamento de Sequência/estatística & dados numéricos , Software
4.
J Comput Biol ; 29(1): 19-22, 2022 01.
Artigo em Inglês | MEDLINE | ID: mdl-34985990

RESUMO

Although the availability of various sequencing technologies allows us to capture different genome properties at single-cell resolution, with the exception of a few co-assaying technologies, applying different sequencing assays on the same single cell is impossible. Single-cell alignment using optimal transport (SCOT) is an unsupervised algorithm that addresses this limitation by using optimal transport to align single-cell multiomics data. First, it preserves the local geometry by constructing a k-nearest neighbor (k-NN) graph for each data set (or domain) to capture the intra-domain distances. SCOT then finds a probabilistic coupling matrix that minimizes the discrepancy between the intra-domain distance matrices. Finally, it uses the coupling matrix to project one single-cell data set onto another through barycentric projection, thus aligning them. SCOT requires tuning only two hyperparameters and is robust to the choice of one. Furthermore, the Gromov-Wasserstein distance in the algorithm can guide SCOT's hyperparameter tuning in a fully unsupervised setting when no orthogonal alignment information is available. Thus, SCOT is a fast and accurate alignment method that provides a heuristic for hyperparameter selection in a real-world unsupervised single-cell data alignment scenario. We provide a tutorial for SCOT and make its source code publicly available on GitHub.


Assuntos
Algoritmos , Alinhamento de Sequência/estatística & dados numéricos , Análise de Célula Única/estatística & dados numéricos , Biologia Computacional , Bases de Dados Genéticas/estatística & dados numéricos , Genômica/estatística & dados numéricos , Heurística , Humanos , Redes Neurais de Computação , Análise de Sequência/estatística & dados numéricos , Software , Aprendizado de Máquina não Supervisionado
5.
J Comput Biol ; 29(1): 56-73, 2022 01.
Artigo em Inglês | MEDLINE | ID: mdl-34986026

RESUMO

Over the past decade, a promising line of cancer research has utilized machine learning to mine statistical patterns of mutations in cancer genomes for information. Recent work shows that these statistical patterns, commonly referred to as "mutational signatures," have diverse therapeutic potential as biomarkers for cancer therapies. However, translating this potential into reality is hindered by limited access to sequencing in the clinic. Almost all methods for mutational signature analysis (MSA) rely on whole genome or whole exome sequencing data, while sequencing in the clinic is typically limited to small gene panels. To improve clinical access to MSA, we considered the question of whether targeted panels could be designed for the purpose of mutational signature detection. Here we present ScalpelSig, to our knowledge the first algorithm that automatically designs genomic panels optimized for detection of a given mutational signature. The algorithm learns from data to identify genome regions that are particularly indicative of signature activity. Using a cohort of breast cancer genomes as training data, we show that ScalpelSig panels substantially improve accuracy of signature detection compared to baselines. We find that some ScalpelSig panels even approach the performance of whole exome sequencing, which observes over 10 × as much genomic material. We test our algorithm under a variety of conditions, showing that its performance generalizes to another dataset of breast cancers, to smaller panel sizes, and to lesser amounts of training data.


Assuntos
Algoritmos , Análise Mutacional de DNA/estatística & dados numéricos , Genômica/estatística & dados numéricos , Neoplasias da Mama/genética , Estudos de Coortes , Biologia Computacional , Bases de Dados Genéticas/estatística & dados numéricos , Feminino , Humanos , Aprendizado de Máquina , Mutação , Sequenciamento Completo do Genoma/estatística & dados numéricos
6.
J Comput Biol ; 29(1): 3-18, 2022 01.
Artigo em Inglês | MEDLINE | ID: mdl-35050714

RESUMO

Recent advances in sequencing technologies have allowed us to capture various aspects of the genome at single-cell resolution. However, with the exception of a few of co-assaying technologies, it is not possible to simultaneously apply different sequencing assays on the same single cell. In this scenario, computational integration of multi-omic measurements is crucial to enable joint analyses. This integration task is particularly challenging due to the lack of sample-wise or feature-wise correspondences. We present single-cell alignment with optimal transport (SCOT), an unsupervised algorithm that uses the Gromov-Wasserstein optimal transport to align single-cell multi-omics data sets. SCOT performs on par with the current state-of-the-art unsupervised alignment methods, is faster, and requires tuning of fewer hyperparameters. More importantly, SCOT uses a self-tuning heuristic to guide hyperparameter selection based on the Gromov-Wasserstein distance. Thus, in the fully unsupervised setting, SCOT aligns single-cell data sets better than the existing methods without requiring any orthogonal correspondence information.


Assuntos
Algoritmos , Genômica/estatística & dados numéricos , Alinhamento de Sequência/estatística & dados numéricos , Análise de Célula Única/estatística & dados numéricos , Biologia Computacional , Simulação por Computador , Bases de Dados Genéticas/estatística & dados numéricos , Humanos , Modelos Estatísticos , Aprendizado de Máquina não Supervisionado
7.
J Comput Biol ; 29(2): 140-154, 2022 02.
Artigo em Inglês | MEDLINE | ID: mdl-35049334

RESUMO

k-mer counts are important features used by many bioinformatics pipelines. Existing k-mer counting methods focus on optimizing either time or memory usage, producing in output very large count tables explicitly representing k-mers together with their counts. Storing k-mers is not needed if the set of k-mers is known, making it possible to only keep counters and their association to k-mers. Solutions avoiding explicit representation of k-mers include Minimal Perfect Hash Functions (MPHFs) and Count-Min sketches. We introduce Set-Min sketch-a sketching technique for representing associative maps inspired from Count-Min-and apply it to the problem of representing k-mer count tables. Set-Min is provably more accurate than both Count-Min and Max-Min-an improved variant of Count-Min for static datasets that we define here. We show that Set-Min sketch provides a very low error rate, in terms of both the probability and the size of errors, at the expense of a very moderate memory increase. On the other hand, Set-Min sketches are shown to take up to an order of magnitude less space than MPHF-based solutions, for fully assembled genomes and large k. Space-efficiency of Set-Min in this case takes advantage of the power-law distribution of k-mer counts in genomic datasets.


Assuntos
Biologia Computacional/métodos , Genômica/estatística & dados numéricos , Software , Algoritmos , Animais , Gráficos por Computador , Bases de Dados Genéticas/estatística & dados numéricos , Genoma Humano , Humanos , Modelos Estatísticos , Anotação de Sequência Molecular/estatística & dados numéricos
8.
J Comput Biol ; 29(2): 169-187, 2022 02.
Artigo em Inglês | MEDLINE | ID: mdl-35041495

RESUMO

Recently, Gagie et al. proposed a version of the FM-index, called the r-index, that can store thousands of human genomes on a commodity computer. Then Kuhnle et al. showed how to build the r-index efficiently via a technique called prefix-free parsing (PFP) and demonstrated its effectiveness for exact pattern matching. Exact pattern matching can be leveraged to support approximate pattern matching, but the r-index itself cannot support efficiently popular and important queries such as finding maximal exact matches (MEMs). To address this shortcoming, Bannai et al. introduced the concept of thresholds, and showed that storing them together with the r-index enables efficient MEM finding-but they did not say how to find those thresholds. We present a novel algorithm that applies PFP to build the r-index and find the thresholds simultaneously and in linear time and space with respect to the size of the prefix-free parse. Our implementation called MONI can rapidly find MEMs between reads and large-sequence collections of highly repetitive sequences. Compared with other read aligners-PuffAligner, Bowtie2, BWA-MEM, and CHIC- MONI used 2-11 times less memory and was 2-32 times faster for index construction. Moreover, MONI was less than one thousandth the size of competing indexes for large collections of human chromosomes. Thus, MONI represents a major advance in our ability to perform MEM finding against very large collections of related references.


Assuntos
Algoritmos , Genômica/estatística & dados numéricos , Alinhamento de Sequência/estatística & dados numéricos , Software , Biologia Computacional , Bases de Dados Genéticas/estatística & dados numéricos , Genoma Bacteriano , Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Humanos , Salmonella/genética , Análise de Sequência de DNA/estatística & dados numéricos , Análise de Ondaletas
9.
J Comput Biol ; 29(2): 188-194, 2022 02.
Artigo em Inglês | MEDLINE | ID: mdl-35041518

RESUMO

Efficiently finding maximal exact matches (MEMs) between a sequence read and a database of genomes is a key first step in read alignment. But until recently, it was unknown how to build a data structure in [Formula: see text] space that supports efficient MEM finding, where r is the number of runs in the Burrows-Wheeler Transform. In 2021, Rossi et al. showed how to build a small auxiliary data structure called thresholds in addition to the r-index in [Formula: see text] space. This addition enables efficient MEM finding using the r-index. In this article, we present the tool that implements this solution, which we call MONI. Namely, we give a high-level view of the main components of the data structure and show how the source code can be downloaded, compiled, and used to find MEMs between a set of sequence reads and a set of genomes.


Assuntos
Algoritmos , Alinhamento de Sequência/estatística & dados numéricos , Software , Biologia Computacional , Bases de Dados Genéticas/estatística & dados numéricos , Genoma Humano , Genômica/estatística & dados numéricos , Humanos , Análise de Sequência de DNA/estatística & dados numéricos
11.
PLoS Comput Biol ; 17(11): e1009161, 2021 11.
Artigo em Inglês | MEDLINE | ID: mdl-34762640

RESUMO

Network propagation refers to a class of algorithms that integrate information from input data across connected nodes in a given network. These algorithms have wide applications in systems biology, protein function prediction, inferring condition-specifically altered sub-networks, and prioritizing disease genes. Despite the popularity of network propagation, there is a lack of comparative analyses of different algorithms on real data and little guidance on how to select and parameterize the various algorithms. Here, we address this problem by analyzing different combinations of network normalization and propagation methods and by demonstrating schemes for the identification of optimal parameter settings on real proteome and transcriptome data. Our work highlights the risk of a 'topology bias' caused by the incorrect use of network normalization approaches. Capitalizing on the fact that network propagation is a regularization approach, we show that minimizing the bias-variance tradeoff can be utilized for selecting optimal parameters. The application to real multi-omics data demonstrated that optimal parameters could also be obtained by either maximizing the agreement between different omics layers (e.g. proteome and transcriptome) or by maximizing the consistency between biological replicates. Furthermore, we exemplified the utility and robustness of network propagation on multi-omics datasets for identifying ageing-associated genes in brain and liver tissues of rats and for elucidating molecular mechanisms underlying prostate cancer progression. Overall, this work compares different network propagation approaches and it presents strategies for how to use network propagation algorithms to optimally address a specific research question at hand.


Assuntos
Algoritmos , Biologia Computacional/métodos , Envelhecimento/genética , Envelhecimento/metabolismo , Animais , Viés , Encéfalo/metabolismo , Biologia Computacional/estatística & dados numéricos , Interpretação Estatística de Dados , Progressão da Doença , Perfilação da Expressão Gênica/estatística & dados numéricos , Redes Reguladoras de Genes , Genômica/estatística & dados numéricos , Humanos , Fígado/metabolismo , Masculino , Neoplasias da Próstata/etiologia , Neoplasias da Próstata/genética , Neoplasias da Próstata/metabolismo , Mapas de Interação de Proteínas , Proteômica/estatística & dados numéricos , RNA Mensageiro/genética , RNA Mensageiro/metabolismo , Ratos , Biologia de Sistemas
12.
PLoS Comput Biol ; 17(11): e1009449, 2021 11.
Artigo em Inglês | MEDLINE | ID: mdl-34780468

RESUMO

The cost of sequencing the genome is dropping at a much faster rate compared to assembling and finishing the genome. The use of lightly sampled genomes (genome-skims) could be transformative for genomic ecology, and results using k-mers have shown the advantage of this approach in identification and phylogenetic placement of eukaryotic species. Here, we revisit the basic question of estimating genomic parameters such as genome length, coverage, and repeat structure, focusing specifically on estimating the k-mer repeat spectrum. We show using a mix of theoretical and empirical analysis that there are fundamental limitations to estimating the k-mer spectra due to ill-conditioned systems, and that has implications for other genomic parameters. We get around this problem using a novel constrained optimization approach (Spline Linear Programming), where the constraints are learned empirically. On reads simulated at 1X coverage from 66 genomes, our method, REPeat SPECTra Estimation (RESPECT), had 2.2% error in length estimation compared to 27% error previously achieved. In shotgun sequenced read samples with contaminants, RESPECT length estimates had median error 4%, in contrast to other methods that had median error 80%. Together, the results suggest that low-pass genomic sequencing can yield reliable estimates of the length and repeat content of the genome. The RESPECT software will be publicly available at https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_shahab-2Dsarmashghi_RESPECT.git&d=DwIGAw&c=-35OiAkTchMrZOngvJPOeA&r=ZozViWvD1E8PorCkfwYKYQMVKFoEcqLFm4Tg49XnPcA&m=f-xS8GMHKckknkc7Xpp8FJYw_ltUwz5frOw1a5pJ81EpdTOK8xhbYmrN4ZxniM96&s=717o8hLR1JmHFpRPSWG6xdUQTikyUjicjkipjFsKG4w&e=.


Assuntos
Algoritmos , Genoma , Genômica/estatística & dados numéricos , Sequências Repetitivas de Ácido Nucleico , Software , Animais , Biologia Computacional , Simulação por Computador , Bases de Dados Genéticas/estatística & dados numéricos , Humanos , Invertebrados/classificação , Invertebrados/genética , Análise dos Mínimos Quadrados , Modelos Lineares , Mamíferos/classificação , Mamíferos/genética , Modelos Genéticos , Filogenia , Plantas/classificação , Plantas/genética , Vertebrados/classificação , Vertebrados/genética
13.
Clin Epigenetics ; 13(1): 179, 2021 09 25.
Artigo em Inglês | MEDLINE | ID: mdl-34563241

RESUMO

BACKGROUND: Nasal intestinal-type adenocarcinomas (ITAC) are strongly related to chronic wood dust exposure: The intestinal phenotype relies on CDX2 overexpression but underlying molecular mechanisms remain unknown. Our objectives were to investigate transcriptomic and methylation differences between healthy non-exposed and tumor olfactory cleft mucosae and to compare transcriptomic profiles between non-exposed, wood dust-exposed and ITAC mucosa cells. METHODS: We conducted a prospective monocentric study (NCT0281823) including 16 woodworkers with ITAC, 16 healthy exposed woodworkers and 13 healthy, non-exposed, controls. We compared tumor samples with healthy non-exposed samples, both in transcriptome and in methylome analyses. We also investigated wood dust-induced transcriptome modifications of exposed (without tumor) male woodworkers' samples and of contralateral sides of woodworkers with tumors. We conducted in parallel transcriptome and methylome analysis, and then, the transcriptome analysis was focused on the genes highlighted in methylome analysis. We replicated our results on dataset GSE17433. RESULTS: Several clusters of genes enabled the distinction between healthy and ITAC samples. Transcriptomic and IHC analysis confirmed a constant overexpression of CDX2 in ITAC samples, without any specific DNA methylation profile regarding the CDX2 locus. ITAC woodworkers also exhibited a specific transcriptomic profile in their contralateral (non-tumor) olfactory cleft, different from that of other exposed woodworkers, suggesting that they had a different exposure or a different susceptibility. Two top-loci (CACNA1C/CACNA1C-AS1 and SLC26A10) were identified with a hemimethylated profile, but only CACNA1C appeared to be overexpressed both in transcriptomic analysis and in immunohistochemistry. CONCLUSIONS: Several clusters of genes enable the distinction between healthy mucosa and ITAC samples even in contralateral nasal fossa thus paving the way for a simple diagnostic tool for ITAC in male woodworkers. CACNA1C might be considered as a master gene of ITAC and should be further investigated. TRIAL REGISTRATION: NIH ClinicalTrials, NCT0281823, registered May 23d 2016, https://www.clinicaltrials.gov/NCT0281823 .


Assuntos
Canais de Cálcio Tipo L/metabolismo , Genômica/métodos , Neoplasias Intestinais/genética , Neoplasias Nasais/genética , Adenocarcinoma/epidemiologia , Adenocarcinoma/genética , Idoso , Canais de Cálcio Tipo L/genética , Metilação de DNA/efeitos dos fármacos , Feminino , Genômica/instrumentação , Genômica/estatística & dados numéricos , Humanos , Neoplasias Intestinais/epidemiologia , Masculino , Pessoa de Meia-Idade , Neoplasias Nasais/epidemiologia , Exposição Ocupacional/análise , Madeira
14.
PLoS Comput Biol ; 17(8): e1009254, 2021 08.
Artigo em Inglês | MEDLINE | ID: mdl-34343164

RESUMO

Driven by the necessity to survive environmental pathogens, the human immune system has evolved exceptional diversity and plasticity, to which several factors contribute including inheritable structural polymorphism of the underlying genes. Characterizing this variation is challenging due to the complexity of these loci, which contain extensive regions of paralogy, segmental duplication and high copy-number repeats, but recent progress in long-read sequencing and optical mapping techniques suggests this problem may now be tractable. Here we assess this by using long-read sequencing platforms from PacBio and Oxford Nanopore, supplemented with short-read sequencing and Bionano optical mapping, to sequence DNA extracted from CD14+ monocytes and peripheral blood mononuclear cells from a single European individual identified as HV31. We use this data to build a de novo assembly of eight genomic regions encoding four key components of the immune system, namely the human leukocyte antigen, immunoglobulins, T cell receptors, and killer-cell immunoglobulin-like receptors. Validation of our assembly using k-mer based and alignment approaches suggests that it has high accuracy, with estimated base-level error rates below 1 in 10 kb, although we identify a small number of remaining structural errors. We use the assembly to identify heterozygous and homozygous structural variation in comparison to GRCh38. Despite analyzing only a single individual, we find multiple large structural variants affecting core genes at all three immunoglobulin regions and at two of the three T cell receptor regions. Several of these variants are not accurately callable using current algorithms, implying that further methodological improvements are needed. Our results demonstrate that assessing haplotype variation in these regions is possible given sufficiently accurate long-read and associated data. Continued reductions in the cost of these technologies will enable application of these methods to larger samples and provide a broader catalogue of germline structural variation at these loci, an important step toward making these regions accessible to large-scale genetic association studies.


Assuntos
Variação Genética , Genoma Humano/imunologia , Sistema Imunitário , Algoritmos , Biologia Computacional , Variações do Número de Cópias de DNA , Genômica/métodos , Genômica/estatística & dados numéricos , Antígenos HLA/genética , Haplótipos , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Humanos , Fenômenos Imunogenéticos , Imunoglobulinas/genética , Receptores de Antígenos de Linfócitos T/genética , Receptores KIR/genética , Análise de Sequência de DNA/estatística & dados numéricos
15.
PLoS Comput Biol ; 17(8): e1009224, 2021 08.
Artigo em Inglês | MEDLINE | ID: mdl-34383739

RESUMO

Computational integrative analysis has become a significant approach in the data-driven exploration of biological problems. Many integration methods for cancer subtyping have been proposed, but evaluating these methods has become a complicated problem due to the lack of gold standards. Moreover, questions of practical importance remain to be addressed regarding the impact of selecting appropriate data types and combinations on the performance of integrative studies. Here, we constructed three classes of benchmarking datasets of nine cancers in TCGA by considering all the eleven combinations of four multi-omics data types. Using these datasets, we conducted a comprehensive evaluation of ten representative integration methods for cancer subtyping in terms of accuracy measured by combining both clustering accuracy and clinical significance, robustness, and computational efficiency. We subsequently investigated the influence of different omics data on cancer subtyping and the effectiveness of their combinations. Refuting the widely held intuition that incorporating more types of omics data always produces better results, our analyses showed that there are situations where integrating more omics data negatively impacts the performance of integration methods. Our analyses also suggested several effective combinations for most cancers under our studies, which may be of particular interest to researchers in omics data analysis.


Assuntos
Biologia Computacional/métodos , Neoplasias/classificação , Neoplasias/genética , Algoritmos , Biomarcadores Tumorais/genética , Interpretação Estatística de Dados , Bases de Dados Genéticas/estatística & dados numéricos , Aprendizado Profundo , Feminino , Genômica/estatística & dados numéricos , Humanos , Masculino , Aprendizado de Máquina não Supervisionado
16.
Genome Biol ; 22(1): 208, 2021 07 13.
Artigo em Inglês | MEDLINE | ID: mdl-34256818

RESUMO

One challenge facing omics association studies is the loss of statistical power when adjusting for confounders and multiple testing. The traditional statistical procedure involves fitting a confounder-adjusted regression model for each omics feature, followed by multiple testing correction. Here we show that the traditional procedure is not optimal and present a new approach, 2dFDR, a two-dimensional false discovery rate control procedure, for powerful confounder adjustment in multiple testing. Through extensive evaluation, we demonstrate that 2dFDR is more powerful than the traditional procedure, and in the presence of strong confounding and weak signals, the power improvement could be more than 100%.


Assuntos
Algoritmos , Estudo de Associação Genômica Ampla , Genômica/estatística & dados numéricos , Atlas como Assunto , Carcinoma Hepatocelular/genética , Carcinoma Hepatocelular/metabolismo , Metilação de DNA , Conjuntos de Dados como Assunto , Reações Falso-Positivas , Microbioma Gastrointestinal/genética , Genômica/métodos , Hepatite B/genética , Hepatite B/metabolismo , Vírus da Hepatite B/genética , Vírus da Hepatite B/patogenicidade , Humanos , Modelos Lineares , Neoplasias Hepáticas/genética , Neoplasias Hepáticas/metabolismo
17.
PLoS Comput Biol ; 17(7): e1009229, 2021 07.
Artigo em Inglês | MEDLINE | ID: mdl-34280186

RESUMO

Graphs such as de Bruijn graphs and OLC (overlap-layout-consensus) graphs have been widely adopted for the de novo assembly of genomic short reads. This work studies another important problem in the field: how graphs can be used for high-performance compression of the large-scale sequencing data. We present a novel graph definition named Hamming-Shifting graph to address this problem. The definition originates from the technological characteristics of next-generation sequencing machines, aiming to link all pairs of distinct reads that have a small Hamming distance or a small shifting offset or both. We compute multiple lexicographically minimal k-mers to index the reads for an efficient search of the weight-lightest edges, and we prove a very high probability of successfully detecting these edges. The resulted graph creates a full mutual reference of the reads to cascade a code-minimized transfer of every child-read for an optimal compression. We conducted compression experiments on the minimum spanning forest of this extremely sparse graph, and achieved a 10 - 30% more file size reduction compared to the best compression results using existing algorithms. As future work, the separation and connectivity degrees of these giant graphs can be used as economical measurements or protocols for quick quality assessment of wet-lab machines, for sufficiency control of genomic library preparation, and for accurate de novo genome assembly.


Assuntos
Algoritmos , Compressão de Dados/métodos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Animais , Biologia Computacional , Gráficos por Computador , Compressão de Dados/estatística & dados numéricos , Bases de Dados Genéticas/estatística & dados numéricos , Genômica/estatística & dados numéricos , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Humanos
18.
J Cancer Res Ther ; 17(2): 477-483, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34121695

RESUMO

PURPOSE: This study systematically reviews the distribution of racial/ancestral features and their inclusion as covariates in genetic-toxicity association studies following radiation therapy. MATERIALS AND METHODS: Original research studies associating genetic features and normal tissue complications following radiation therapy were identified from PubMed. The distribution of radiogenomic studies was determined by mining the statement of country of origin and racial/ancestrial distribution and the inclusion in analyses. Descriptive analyses were performed to determine the distribution of studies across races/ancestries, countries, and continents and the inclusion in analyses. RESULTS: Among 174 studies, only 23 with a population of more one race/ancestry which were predominantly conducted in the United States. Across the continents, most studies were performed in Europe (77 studies averaging at 30.6 patients/million population [pt/mil]), North America (46 studies, 20.8 pt/mil), Asia (46 studies, 2.4 pt/mil), South America (3 studies, 0.4 pt/mil), Oceania (2 studies, 2.1 pt/mil), and none from Africa. All 23 studies with more than one race/ancestry considered race/ancestry as a covariate, and three studies showed race/ancestry to be significantly associated with endpoints. CONCLUSION: Most toxicity-related radiogenomic studies involved a single race/ancestry. Individual Participant Data meta-analyses or multinational studies need to be encouraged.


Assuntos
Predisposição Genética para Doença , Genômica/estatística & dados numéricos , Neoplasias/radioterapia , Grupos Raciais/estatística & dados numéricos , Lesões por Radiação/genética , Humanos , Neoplasias/genética , Grupos Raciais/genética , Lesões por Radiação/epidemiologia
19.
Comput Math Methods Med ; 2021: 9969751, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34122622

RESUMO

Genomic islands are related to microbial adaptation and carry different genomic characteristics from the host. Therefore, many methods have been proposed to detect genomic islands from the rest of the genome by evaluating its sequence composition. Many sequence features have been proposed, but many of them have not been applied to the identification of genomic islands. In this paper, we present a scheme to predict genomic islands using the chi-square test and random forest algorithm. We extract seven kinds of sequence features and select the important features with the chi-square test. All the selected features are then input into the random forest to predict the genome islands. Three experiments and comparison show that the proposed method achieves the best performance. This understanding can be useful to design more powerful method for the genomic island prediction.


Assuntos
Ilhas Genômicas , Genômica/métodos , Algoritmos , Distribuição de Qui-Quadrado , Biologia Computacional , Bases de Dados Genéticas/estatística & dados numéricos , Genética Microbiana/métodos , Genética Microbiana/estatística & dados numéricos , Genoma Bacteriano , Genômica/estatística & dados numéricos , Modelos Genéticos
20.
PLoS Comput Biol ; 17(6): e1009064, 2021 06.
Artigo em Inglês | MEDLINE | ID: mdl-34077420

RESUMO

Technological advances have enabled us to profile multiple molecular layers at unprecedented single-cell resolution and the available datasets from multiple samples or domains are growing. These datasets, including scRNA-seq data, scATAC-seq data and sc-methylation data, usually have different powers in identifying the unknown cell types through clustering. So, methods that integrate multiple datasets can potentially lead to a better clustering performance. Here we propose coupleCoC+ for the integrative analysis of single-cell genomic data. coupleCoC+ is a transfer learning method based on the information-theoretic co-clustering framework. In coupleCoC+, we utilize the information in one dataset, the source data, to facilitate the analysis of another dataset, the target data. coupleCoC+ uses the linked features in the two datasets for effective knowledge transfer, and it also uses the information of the features in the target data that are unlinked with the source data. In addition, coupleCoC+ matches similar cell types across the source data and the target data. By applying coupleCoC+ to the integrative clustering of mouse cortex scATAC-seq data and scRNA-seq data, mouse and human scRNA-seq data, mouse cortex sc-methylation and scRNA-seq data, and human blood dendritic cells scRNA-seq data from two batches, we demonstrate that coupleCoC+ improves the overall clustering performance and matches the cell subpopulations across multimodal single-cell genomic datasets. coupleCoC+ has fast convergence and it is computationally efficient. The software is available at https://github.com/cuhklinlab/coupleCoC_plus.


Assuntos
Genômica/estatística & dados numéricos , Aprendizado de Máquina , Software , Animais , Córtex Cerebral/metabolismo , Análise por Conglomerados , Biologia Computacional , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Células Dendríticas/metabolismo , Humanos , Teoria da Informação , Camundongos , RNA Citoplasmático Pequeno/genética , RNA-Seq , Análise de Célula Única/estatística & dados numéricos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...